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Abstract 

Metric learning has attracted a lot of interest over the last decade, but little work has been done 
about the generalization ability of such methods. In this paper, we address this issue by proposing 
an adaptation of the notion of algorithmic robustness, previously introduced by Xu and Mannor in 
■ classic supervised learning, to derive generalization bounds for metric learning. We also show that 

a weak notion of robustness is a necessary and sufficient condition to generalize, justifying that it is 
fundamental for metric learning. We provide some illustrative examples of our approach on a large 
class of existing algorithms. Keywords: Metric learning, Algorithmic robustness, Generalization 
bounds. 
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O " 1 Introduction 

■ The past ten years have seen a growing interest in supervised metric learning. Indeed, the relevance of a 

distance or a similarity, for a given task, is of crucial importance to the effectiveness of many classification 
or clustering methods. For this reason, a lot of research has been devoted to automatically learning 
distances or similarities from supervised data. Existing approaches rely on the fairly reasonable principle 
that, according to a good metric, pairs of examples with the same (resp. different) labels must be close 
to each other (resp. far away). Learning thus generally consists in finding the best parameters of the 
metric function given a set of labeled pairsQ The most classic and commonly used approach in the 
literature focuses on Mahalanobis distance learning where the objective is to learn a positive semi-definite 
(PSD) matrix [J 0, IE 0, IE IE S Hi inducing a linear projection of the data where the Euclidean distance 
performs well. Other approaches have also considered arbitrary similarity functions with no PSD constraint 
CN ■ [E HE EH ■ The learned distance or similarity is then typically used to improve the performance of nearest- 

neighbor methods. 

From a theoretical standpoint, many papers have studied the convergence rate of the optimization problem 
used to learn the parameters of the metric. However and somewhat surprisingly, few studies have been 
done about the generalization ability of learned metrics on unseen data. This situation can be explained 
by the fact that one cannot assume that the learning pairs provided to a metric learning algorithm are 
independent and identically distributed (IID). Indeed, these pairs are generally given by an expert and/or 
extracted from a sample of individual instances. For example, common procedures for building such 
learning pairs are based either on the k nearest or farthest neighbors of each example, some criterion of 
diversity 13, taking all the possible pairs or drawing pairs randomly from a learning sample. Online 
methods 17IE 0] nevertheless offer guarantees, but only in the form of regret bounds assessing the 
deviation between the cumulative loss suffered by the online algorithm and the loss induced by the best 
hypothesis that can be chosen in hindsight. Apart from these results, as far as we know, very few papers 
have proposed a theoretical study on the generalization ability of supervised metric learning methods. 



The approach of Bian and Tao |13[ uses a statistical analysis to give generalization guarantees for loss 
minimization methods, but their results assume some hypotheses on the distribution of the examples 
and do not take into account any regularization on the metric. The most general contribution has been 



*We would like to acknowledge support from the ANR LAMPADA 09-EMER-007-02 project and the PASCAL 2 Network 
of Excellence. 

1 These pairs are sometimes replaced by triplets (x, y, z) such that example x must be closer to example y than to example 
2, where x and y share the same label and z does not. 
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proposed by Jin et al. [14J who adapted the framework of uniform stability [15( to regularized metric 
learning. However, their approach is based on a Frobenius norm regularizer and cannot be applied to any 
type of regularization, in particular sparsity- inducing norms (l6l | . 

In this paper, we propose to address this lack of theoretical framework by studying the generalization ability 
of metric learning algorithms according to a notion of algorithmic robustness. Algorithmic robustness, 



introduced by Xu et al. [17|, |18j, allows one to derive generalization bounds when given two "close" 
training and testing examples the variation between their associated loss is bounded. This notion of 
closeness examples relies on a partition of the input space into different regions such that two examples 
in the same region are said close. This framework has been successfully used in the classic supervised 
learning setting for deriving generalization bounds for SVM, Lasso and more. We propose here to adapt 
this notion of algorithmic robustness to metric learning that works both for similarity and distance learning. 
We show that, in this context, the problem of non-IIDness of the learning pairs can be worked around 
by simply assuming that the pairs are built from an IID sample of labeled examples. Moreover, following 
the work of Xu et al. (l8| . we provide a notion of weak robustness that is necessary and sufficient for 
metric learning algorithms to generalize well, highlighting that robustness is a fundamental property. We 
illustrate the applicability of our framework by deriving generalization bounds, using very few approach- 
specific arguments, for a larger class of problems than Jin et al. that can accommodate a vast choice of 
rcgularizers, without any assumption on the distribution of the examples. 

The rest of the paper is organized as follows. We introduce some preliminaries and notations in Section [2j 
Our notion of algorithmic robustness for metric learning is presented in Section [3J The necessity and 
sufficiency of weak robustness is shown in Section 2] Section [5] is devoted to the illustration of our 
framework to actual metric learning algorithms. Finally, we conclude in Section [SJ 



2 Preliminaries 

2.1 Notations 

Let X be the instance space, Y be a finite label set and let Z = X x Y. In the following, z = (x,y) G Z 
means x £ X and y £ Y. Let /i be an unknown probability distribution over Z. We assume that X is a 
compact convex metric space w.r.t. a norm || • || such that X C R d , thus there exists a constant R such 
that Vx £ X, \\x\\ < R. A similarity or distance function is a pairwise function / : X x X — > R. In the 
following, we use the generic term metric to refer to either a similarity or a distance function. We denote 
by s a labeled training sample consisting of n training instances (si, . . . , s n ) drawn IID from fi. The sample 
of all possible pairs built from s is denoted by p s such that p s = {(s±, si), . . . , (s%, s n ), . . . , (s n , s n )}. A 
metric learning algorithm A takes as input a finite set of pairs from (Z x Z) n and outputs a metric. We 
denote by A Pa the metric learned by an algorithm A from a sample p s of pairs. For any pair of labeled 
examples (z,z') and any metric /, we associate a loss function l(f,z,z') which depends on the examples 
and their labels. This loss is assumed to be nonnegative and uniformly bounded by a constant B. We 
define the true generalization loss over fi by £(/) = E 2 , Z '~ M ^(/, z, z'). We denote the empirical loss over 
the sample p s by l emp (f) = ^ E"=i E"=i KL »i, Sj) = ^ T,{ Si ,sj)e Ps *(/> s ^ s j)- 



2.2 Robustness for classical supervised learning 



The notion of algorithmic robustness, introduced by Xu and Mannor [17|, [18| in the context of classic 
supervised learning, is based on the deviation between the losses associated to two training and testing 
instances that are close. An algorithm is said (K, e(s))-robust if there exists a partition of the space 
Z = X x Y into K disjoint subsets such that for every learning and testing instances belonging to the 
same region of the partition, the deviation between their associated losses is bounded by a term e(s). From 
this definition, the authors have proved a convergence bound for the difference between the empirical and 



true losses of the form e(s) + By 2 - R " ln2 + 21n1 /' 5 ( w ith probability 1 — S). This bound depends on K and 
e(s) which can be made as small as desired by refining this partition. When considering metric spaces, 
the partition of Z can be obtained by the notion of covering number (l9j . 
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Definition 1 For a metric space (X,p), and T C X, we say that T C T is a y-cover ofT, i/V< G T, 
3t G T such that p(t,f) < 7. The ^-covering number ofT is 

7V(7, T, p) = min{|T| : T is a 7 — cover ofT}. 

For example, when X is a compact convex space, for any 7 > 0, the quantity A/"(7, X, p) is finite leading 
to a finite cover. If we consider the space Z, we can note that the label set can be partitioned into \Y\ 
sets. Thus, Z can be partitioned into \Y \J\f(j, X, p) subsets such that if two instances z\ = (xi,yi), 
z 2 = (^2,2/2) belong to the same subset, then y\ — y\ and p{x\,x 2 ) < 7. 



3 Robustness and Generalization for Metric Learning 

We present here our adaptation of robustness to metric learning. The idea is to use the partition of Z 
at the pair level: if a new test pair of examples is close to a learning pair, then the losses of the two 
pairs must be close. Two pairs are close when each instance of the first pair fall into the same subset of 
the partition of Z as the corresponding instance of the other pair. A metric learning algorithm with this 
property is said robust. This notion is formalized as follows. 

Definition 2 An algorithm A is (K, e(-)) robust for K S N and e(-) : (ZxZ) n — > M if Z can be partitioned 
into K disjoints sets, denoted by such that for all sample s G Z n and the pair set p(s) associated 

to this sample, the following holds: 

V(si, s 2 ) G p(s), Vzi, z 2 G Z, Vi, j = I, . . . , K : if Si,z\ G Ci and s 2 , z 2 G Cj then 

\l(A Pt ,Sx,s 2 ) - l(A Pt , Zi, z 2 )\ < e(p s ). (I) 



K and e(-) quantify the robustness of the algorithm which depends on the learning sample. The property 
of robustness is required for every training pair of the sample; we will see later that this property can be 
relaxed. 

Note that this definition of robustness can be easily extended to triplet based metric learning algorithms. 
Instead of considering all the pairs p s from an IID sample s, we take the admissible triplet set trip s of s 
such that (si, 52,53) G trip s means Si and s 2 share the same label while Si and S3 have different ones, 
with the interpretation that si must be more similar to s 2 than to S3. The robustness property can then 
be expressed by: V(si,S2,S3) G trip s , Vzi, z 2 , Z3 G Z,\/i,j,l = 1,...,K : if s\,z\ G Ci, s 2 ,z 2 G Cj and 
s 3 , z 3 G Ci then 

\l(Atri P s,si,S2,s 3 ) - l(Atri Pa ,zi,z 2 ,z 3 )\ <e(trip s ). (2) 



3.1 Generalization of robust algorithms 

We now give a PAC generalization bound for metric learning algorithms fulfilling the property of robustness 
(Definition^). We first begin by presenting a concentration inequality that will help us to derive the bound. 



Proposition 1 ([20() Let (|iVi|, . . . , |-ZVr-|) an IID multinomial random variable with parameters n and 



2 K exp 



, p(Ck))- By the Breteganolle-Huber- Carol inequality we have: Pr jX^iLi 
" A |, hence with probability at least 1 — 5, 



v 



K 

E 

i=l 



n 



< 



2inn2 + 21n(f/<5) 



> A ^ < 



(3) 



We now give our first result on the generalization of metric learning algorithms. 

Theorem 1 If a learning algorithm A is (K, e{-))-robust and the training sample is made of the pairs p s 
obtained from a sample s generated by n IID draws from p,, then for any S > 0, with probability at least 
1 — S we have: 



\£(ApJ - l emp (A p J\ < e(p B ) + 2B 



2inn2 + 21nI/5 
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Proof Let Ni be the set of index of points of s that fall into the C*. (|iVi|, . . . , | JVjc | ) is a IID random 
variable with parameters n and (fi(Ci), . . . , )j,{Cr)). Wc have: 

K K ^ n n 

= ^^E 2li22 ^ M (;(^ Ps ,zi,z 2 )|zi e Ci,z 2 G Cj)n(Ci)n(Cj) — 5 X!X^(A> s > s i> s . 



(a) 

< 



(6) 

< 



< 



(d) 
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if if 



i=l j=l 



^^E Zl ^ 2 ^,^(/(^lp s ,zi,z2)ki e Ci,z 2 6 Cj)fi(Ci)fi(Cj)- 
»=i j=i 



x if 



E zl|Z2 ^(/(^4. Pe , zi, z 2 )|zi G Ci,z 2 6 Cj)n(Ci) 

i=l j=l 
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if if 



|jV-| 1 



i=l j=l 

if if 



i=l J=l 



«i,-S2)|zi G Ci,z 2 G Cj)n(Ci)(n{Cj) — ) 

<=i J=i n 



if if 



y^ y^ ^zi,za~n (K-Aps » [zi g Ci,z 2 g Cj)(j,(Ci] 
»=i j=i 



I AT, 



if if 



y^ y^ E Zl|Z2 ^(/(.4 p> , zi, z 2 )|zi G Ci,2; 2 G Cj) 
»=1 i'=l 
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Inequalities (a) and (b) are due to the triangle inequality, (c) uses the fact that I is bounded by B, that 
Tld=i — 1 by definition of a multinomial random variable and that y~]f—i — 1 by definition 

of the JVj. Lastly, (ci) is due to the hypothesis of robustness (Equation [T]) and (e) to the application of 
Proposition [TJ □ 

The previous bound depends on K which is given by the cover chosen for Z. If for any K, the associated 
e(-) is a constant (i.e. £if(s) = ck) for any s, we can prove a bound holding uniformly for all K: 

£(A>s) ~ lemp(A Pa )\ < infx>i €k + 2i?y^--- n2 ^ 21n1 ^ . This also gives an insight into the objective of 

any robust algorithm: according to a partition of the labeled input space, given two regions, minimize the 
maximum loss over pairs of examples belonging to each region. 

For triplet based metric learning algorithms, by following the definition of robustness given by Equation^ 
and adapting straight forwardly the losses to triplets such that they output zero for non admissible ones, 
Theorem [T] can be easily extended to obtain the following generalization bound: 



\C(A t ripJ - lemp(A t rip s )\ < e(trip s ) + 3B 



2iCln2 + 21nl/(5 



(4) 
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3.2 Pseudo-robustness 



The previous study requires the robustness property to be true for every learning pair. We show, with the 
following definition, that it is possible to relax the robustness to be true for only a subpart of the sample 
and yet be able to derive generalization guarantees. 

Definition 3 An algorithm A is (K,e(-),p n (-)) pseudo robust for K 6 N, e(-) : (Z x Z) n — >■ K and 
Pn(') '■ [2 x Z) n — > {1, . . . ,n 2 }, if Z can be partitioned into K disjoints sets, denoted by {Ci}fL lr such 
that for all s £ Z n IID from fi, there exists a subset of training pairs samples p s C p s , with \p s \ = p n (p s ), 
such that the following holds: 

V(si, s 2 ) 6 fts, Vzi, z 2 e Z,Vi,j — 1, . . . , K: if s\,zi G C, and s 2 ,Z2 £ Cj then 

\l(Ap s ,s-L,s 2 ) - l(Ap s , zi, z 2 )\ < e(p 8 ). (5) 



We can easily observe that (K, e(-))-robust is equivalent to (K, e(-),n 2 ) pseudo-robust. The following 
theorem illustrates the generalization guarantees associated to the pseudo-robustness property. 

Theorem 2 If a learning algorithm A is (K , e(-) , p n (-)) pseudo-robust, the training pairs p s come from a 
sample generated by n IID draws from /i, then for any 5 > 0, with probability at least 1 — S we have: 



\C(A P J - WAJI < P ^( Ps ) + B{ !^M + o. /2^1n2 + 21nl/^ 



The proof is similar to that of Theorem [T] and is given in Appendix IA.1I 

The notion of pseudo-robustness characterizes a situation that often occurs in metric learning: it is 
sometimes difficult to optimize the metric over all the possible pairs. This theorem shows that it suffices 
to have a property of robustness over only a subset of the possible pairs to have generalization guarantees. 
Moreover, it also gives an insight into the behavior of metric learning approaches aiming at learning a 
distance to be plugged in a fc-nearest neighbor classifier such as LMNN [6] . These methods do not optimize 
the distance according to all possible pairs, but only according to the nearest-neighbors of the same class 
and some pairs of different class. According to the previous theorem, this principle is founded provided 
that the robustness property is fulfilled for some of the pairs used to optimize the metric. Finally, note 
that this notion of pseudo-robustness can be also easily adapted to triplet based metric learning. 



4 Necessity of Robustness 

We prove here that a notion of weak robustness is actually necessary and sufficient to generalize in a metric 
learning setup. This result is based on an asymptotic analysis following the work of Xu and Mannor (l8| . 
We consider pairs of instances coming from an increasing sample of training instances s = (sx, s 2 , . . .) and 
from a sample of test instances t = (ti,t%, ■ . .) such that both samples are assumed to be drawn IID from 
a distribution [i. We use s(n) and t(n) to denote the first n examples of the two samples respectively, 
while s* denotes a fixed sequence of examples. 

We use i(/,Pt(n)) = ^2 J2( Si sj)G Pt( } l(f> s i> s j) to refer to the average loss given a set of pairs for any 
learned metric /, and £(/) = E Zj2 /^, M /(/, z, z') for the expected loss. 

We first define a notion of generalizability for metric learning. 

Definition 4 1. Given a training pair setp s * coming from a sequence of examples s* , a metric learning 
method A generalizes w.r.t. p s > if lim n \£(A Ps «, ,) — L(A PsH .. n . ,p s *(n))| = 0- 

2. A learning method A generalizes with probability 1 if it generalizes with respect to the pairs p s of 
almost all samples s IID from \x. 

Note this notion of generalizability implies convergence in mean. We then introduce the notion of weak 
robustness for metric learning. 
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Definition 5 1. Given a set of training pairs p s * coming from a sequence of examples s* , a metric 
learning method A is weakly robust with respect to p s * if there exists a sequence of {T> n C Z n } such 
that Pr(t(n) E V n ) ->■ f and 

lim n \ max \L(A Ps « ,p s(n )) - L(A Pt! « ,p s * {n) )\\ = 0. 



2. A learning method A is almost surely weakly robust if it is robust w.r.t. almost all s. 

The definition of robustness requires the labeled sample space to be partitioned into disjoints subsets such 
that if some instances of pairs of train/test examples belong to the same partition, then they have similar 
loss. Weak robustness is a generalization of this notion where we consider the average loss of testing and 
training pairs: if for a large (in the probabilistic sense) subset of data, the testing loss is close to the 
training loss, then the algorithm is weakly robust. From Proposition [TJ we can see that if for any fixed 
e > there exists K such that an algorithm A is (K, e) robust, then A is weakly robust. We now give the 
main result of this section about the necessity of robustness. 

Theorem 3 Given a fixed sequence of training examples s*, a metric learning method A generalizes w.r.t. 
p s * if and only if it is weakly robust w.r.t. p s * . 

Proof Following [l8j], the sufficiency is obtained by the fact that the testing pairs are obtained from a 
sample t(n) constituted of n IID instances. We give the proof in Appendix IA. 21 

For the necessity, we need the following lemma which is a direct adaptation of a result introduced in [l8| 
(Lemma 2). We provide the proof in Appendix IA.3I for the sake of completeness. 

Lemma 1 Given s* , if a learning method is not weakly robust w.r.t. p s * , there exists e* , S* > such that 
the following holds for infinitely many n: 

Pr(\L(Ap s , (n) ,p t („)) - L(A Pi ( n ),p s , {n) )\ > e*) > 5*. (6) 

Now, recall that / is positive and uniformly bounded by B, thus by the McDiarmid inequality (recalled 
in Appendix IA.4[) we have that for any e, 6 > there exists an index n* such that for any n > n* , with 
probability at least 1 — 5, we have \^zJ2(ti tj)e Pt{ } K-Ap^^ythtj) — ^(* / ^p s * („) ) I — e - This implies the 

convergence £(A Pa * , n , , Pt(n)) ~ ^-(-^p s » ( „)) — ' ^> an( i thus from a given index: 

l^(A> 8 » W 'Pt(«))- £ (A> s . w )l < V (7) 

Now, by contradiction, suppose algorithm A is not weakly robust, Lemma [1] implies Equation [5] holds for 
infinitely many n. This combined with Equation [7] implies that for infinitely many n: 

€* 

lM-V«c»)>Pt(70) - L (Av*(„r-ZV(™)) > y 

which means A does not generalize, thus the necessity of weak robustness is established. □ 
The following corollary follows immediately from Theorem [3] 

Corollary 1 A metric learning method A generalizes with probability 1 if and only if it is almost surely 
weakly robust. 



5 Examples of Robust Metric Learning Algorithms 

We first restrict our attention to Mahalanobis distance learning algorithms of the following form: 

min c||M|| + ^ ]T ff(»«[l-/(M,Xi,Xi)]), (8) 

where s, = (xi,Vi), Sj = (xj,yj), y i3 — 1 if y t — y 3 and -1 otherwise, /(M,Xj, x 3 ) = {x t - Xj) T M(xi - Xj) 
is the Mahalanobis distance parameterized by the d x d PSD matrix M, || • || some matrix norm and c 
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a regularization parameter. The loss function l(f,Si,Sj) — g(yij[l — /(M, x,, Xj)]) outputs a small value 
when its input is large positive and a large value when it is large negative. We assume g to be nonnegative 
and Lipschitz continuous with Lipschitz constant U. Lastly, go — sup s . Sj g(yij[l — /(0,Xj,Xj-)]) is the 
largest loss when M is 0. 

To prove the robustness of ©, we will need the following theorem, which essentially says that if a metric 
learning algorithm achieves approximately the same testing loss for testing pairs that are close to each 
other, then it is robust. 

Theorem 4 Fix 7 > and a metric p of Z. Suppose A satisfies 

\l{A Pts ,z 1 ,z 2 ) - 1(A Pb) z[, z' 2 )\ < e(p s ), Vzi,z 2 ,z[,z 2 : z x ,z 2 G s,p(zi,z[) < j,p(z 2) z 2 ) < 7 
and J\f( r y/2, Z, p) < 00. Then A is (Af(j/ '2, Z, p), e(p s )) -robust. 

Proof By definition of covering number, we can partition X in Af('j/2 1 X, p) subsets such that each 
subset has a diameter less or equal to 7. Furthermore, since Y is a finite set, we can partition Z into 
\Y\N(~i/2,X,p) subsets {d} such that zi,z' 1 G Ci => p(zi,z' 1 ) < 7. Therefore, 

\l(A Ps , Zx, z 2 ) - l(A Ps ,z[,z 2 )\ < e(p s ), \/z 1 ,z 2 ,z / 1 ,z 2 : z u z 2 G s,p(z 1 ,z[) < j,p(z 2 ,z' 2 ) < 7 

implies zi,z 2 G s,zi,z[ G Ci,z 2 ,z 2 G Cj \l(A Pts , z\, z 2 ) — 1(A Pb , z[, z' 2 )\ < e(p s ), which establishes the 
theorem. □ 

We now prove the robustness of ([5]) when ||M|| is the Frobenius norm. 



Example 1 (Frobenius norm) Algorithm © with ||M|| = ||M||^- = yjYlUi T,j=i m % is (\ Y \N(~//2, X, \\- 
\\ 2 ),^^)-robust. 

Proof Let M* be the solution given training data p s . Thus, due to optimality of M*, we have 

c||M*||^ + ^ ff(yy[l-/(M ! x i ,x i )])<c||0||^+i Yl 9(3/ij[l-f(P,x i ,x J )])=g 

and thus ||M*||jf- < go/c. We can partition Z as \Y\N( r y/2, X, \\ ■ \\ 2 ) sets, such that if z and z' belong 
to the same set, then y = y' and ||x — x'\\ 2 < 7. Now, for z±, z 2 , z[, z' 2 G Z, if y± = y[, ||xi — x[\\ 2 < 7, 
y 2 = y 2 and ||x 2 - x' 2 \\ 2 < 7, then: 

\g(y 12 [l - /(M*,X!,x 2 )]) - g(y' 12 [l - f(M*, x[, x' 2 )})\ 



< 


u\( Xl - 


-x 2 ) T M*( Xl - 


- x 2 ) 


- - 


x 2 ) T M*(xi-x 2 )| 




U\( Xl - 


-X 2 ) T M*(X! - 


- x 2 ) 


- (xi - 


x 2 ) T M*K -x' 2 ) 




+ (xi - 


-x 2 ) T M*(x' 1 - 


-4)1 


- - 


-x' 2 ) T M.*{x' 1 -x' 2 )\ 




U\(xi- 


-x 2 ) T M*(xi - 


- x 2 - 


- (x[ +x' 2 )) + (xi - x 2 - (x[ + 


< 


U(\(xi 


-x 2 ) T M*(xi 


-4: 


i| + \(xi 


- x 2 ) T M* (x 2 - x 2 )\ 




+ |(aJi 


-x' 1 ) T M*(x' 1 


+ x 2 ) 


1 + 1(4 


-x 2 ) T M*(x[ + x' 2 )\) 


< 


U(\\x x 


-z 2 || 2 ||M*|M 


Xi - 


4II2 + 


lln-sallallM*!^!^- 




+ \\xi 


-^|| 2 ||M*||^| 


K- 


4II2 + 


\\x 2 -x 2 \\ 2 \\M*\\r\\x[- 



Z2II2 

, 8URjg 
x 2 \\ 2 ) < 

Hence, the example holds by Theorem 01 □ 

Note that for the special case of Example [TJ a generalization bound (with same order of convergence 
rate) based on uniform stability was derived in |14j . However, it is known that sparse algorithms are 
not stable and thus stability-based analysis fails to assess the generalization ability of recent sparse 
metric learning approaches [1, @, HI ■ The key advantage of robustness over stability is that it can accom- 
modate arbitrary p- norms (or even any regularizer which is bounded below by some p-norm) , thanks to 
the equivalence of norms. To illustrate this, we show the robustness when ||M|| is either the i\ norm (used 
in @, Q) which promotes sparsity at the component level, or the l 2i \ norm (used in Q), which is partic- 
ularly interesting in the context of Mahalanobis distance learning since it induces group sparsity at the 
column/row levelH The proofs are reminiscent of that of Example [T] and can be found in Appendices I A. 5 1 
and [XH 

In this case, the linear projection space of the data induced by the learned Mahalanobis distance is of lower dimension 
than the original space, allowing more efficient computations and smaller storage size. 
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Example 2 {l x norm) Algorithm © with \\M\\ = ||M||i is (\Y\J\f(j,X, \\ ■ 8U R J 30 ) -robust. 

Example 3 (£2,1 norm) Consider Algorithm |SJ with ||M|| = |[M||2,i = St=i II to! ||2; where ra 1 is the 
i-th column o/M. T/iis algorithm is (|y|jV(7, A", || • H2), SUR ) y9 ° )-robust. 

Some metric learning algorithms have kernelized versions, for instance [H, H{. I n the following example we 
show robustness for a kernelized formulation. 

Example 4 (Kernelization) Consider the kernelized version of Algorithm 

min c||M|| H +£ ff^Il-ZCM,^),^))]), (9) 

(si,Sj)£p s 

where </>(•) is a feature mapping to a kernel space H, || ■ ||h the norm function o/H and fc(-, ■) i/ie kernel 

function. Consider a cover of X by \\ ■ || 2 (X being compact) and let fu(l) = max a j, e x ,||a-6|| 2 <7(M a ! a ) + 
k(b,b) - 2k(a,b)) and £ 7 ) . // £fte kernel function is continuous, _B 7 and /u are finite 

for any 7 > and i/iws Algorithm^ is {\Y\N{i,X, || • || 2 ), ^^f^ 9 " ) -robust. 



The proof is given in Appendix IA.7I 

Remark 1 We can easily prove similar results for other forms of metrics using the same technique. 
For instance, when the function is a bilinear similarity f(M.,Xi,Xj) = xflsAxj where M is usually not 
constrained to be PSD \Q, It, \Tlj, we can improve the robustness to 2UBr/ga/c. 

Remark 2 Using triplet-based robustness (Equation^), we can for instance show the robustness of two 
popular triplet-based metric learning approaches 0/ /or which no generalization guarantees were known 
(to the best of our knowledge). These algorithms have the following form: 

minc||M|| + — — - V [l-(x l ~x k ) T M(x l -x k ) + {x l -x j ) T M{x t -x j )} + , 
m^o \trip s \ ^ 

(si.sj ,s k )etrip s 

where ||M|| = \\M\\jr in Q/ and ||M|| = ||M||i, 2 in @/. These methods are (Af(j,Z, \\-\\ 2 ), 16U *™° ) -robust 
(by using the same proof technique as in Examples{l\ and\3\). The additional factor 2 comes from the use 
of triplets instead of pairs. 



6 Conclusion 



We proposed a new theoretical framework for evaluating the generalization ability of metric learning 
based on the notion of algorithm robustness originally introduced in 18]. We showed that a weak notion 
of robustness characterizes the generalizability of metric learning algorithms, justifying that robustness is 
fundamental for such algorithms. This framework allows us to derive generalization bounds for a large 
class of algorithms with different regularizations, such as sparsity inducing norms, making the approach 
more powerful than the (few) existing frameworks. Moreover, almost no algorithm-specific argument is 
needed to derive these bounds, which explains why they are often similar. Natural perspectives arise when 
considering different settings. For example, some algorithms use both pair and triplet based information 
as input such as @. Other future work could include studying even more general loss functions and 
regularizers (such as the LogDet divergence used in 0, 0]), unsupervised/semi-supervised methods or 
domain adaptation. Being able to characterize the generalization ability of metric learning directly with 
the kind of classifier using the metric - like fc-NN - is also an interesting and challenging direction. 
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A Appendix 



A.l Proof of Theorem 2 (pseudo-robustness) 
Proof From the proof of Theorem 1, we can easily deduce that: 



K 



\C(Ap B ) -l emp (Ap.)\ < 



IN, 



K K 



zi,z 2 )\z 1 £ Ci,z 2 £ Cj 

i=l ] = 1 



\Ni\ \N 3 \ 1 



n n n 



„2 yiy^(A's> s »> s j) 



=i j=i 



Then, we have 



A" 



< 2BJ2\—-Ka)\ + 

* — ' n 

i=l 
^ K K 

TlEE E E E n l^ CI )^x \l(A Ps ,z,z') -l(A Ps ,s ,si)\ 
i=i 3=1 (s ,s,)ep(s) s eN, steNj 

K K 

^EE E E E 

i 

Pn{P; 



n" 1 — ' * — ' 1 — ' * — ' * — ' zee, z'eCi 

i=l 3 = 1 ( So , Sl )<£p(s) s £N t sjGAfj 



< 



-e(p s )+B(- 



n 2 ~p n (p s ) , „ 2K\n2 + 2]nl/6, 



The second inequality is obtained by the triangle inequality, the last one is obtained by the application of 
Proposition 1, the hypothesis of pseudo- robustness and the fact that I is positive and bounded by B and 
thus \l(Ap s ,z,z')-l(Ap s ,s ,si)\ < B, □ 



A. 2 Proof of sufficiency of Theorem 3 



Proof The proof of sufficiency corresponds to the first part of the proof of Theorem 8 in [18j. When A 
is weakly robust there exits a sequence {D n } such that for any S, e > there exists N(8, e) such that for 
all n > N(S, e), Pr(t(n) € D n ) > 1 - 6 and 

max \L(A Pb , ,p$( n ))-L(Ap B « ,p s * {n) )\<e. (10) 



Therefore for any n > N(S, e), 



= l E t(n)(^(-4p s » ( „),Pt(ri))) - £(Av»(«)>2V(n))l 

= |Pr(t(n) £ r>n)E(L(^.. (f0 ,ft (n) )|t(n) £ D n ) 

+Pr(t(n) G D n )E(£(^ Pe » (n) ,p t(n) )|t(n) G D n ) - L(Ap e * (n) 

1 Ps* (n) ) | 

) Pt(n) )|t(n) £»„) - £(^ Pe , . 
Pr(t(n) G D n )|E(L(^ snti) ,p t( „))|t(n) G D„) - £(^ B .. CB) ,P.*( n ))l 

< (55+ max p§ ( „)) - ,p s , (n) )| 

< 55 + e. 



The first inequality holds because the testing samples t(n) consists of n instances IID from fj,. The second 
equality is obtained by conditional expectation. The next inequality uses the positiveness and the upper 
bound B of the loss function. Finally, we apply Equation[TUl We thus conclude that A generalizes for p s * 
because e and <5 can be chosen arbitrary. □ 
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A. 3 Proof of Lemma 1 

Proof This proof follows exactly the same principle as the proof of Lemma 2 from [l8|. By contradiction, 
assume e* and S* do not exist. Let e v = 5 V = 1 /v for v = 1,2,..., then there exists a non decreasing sequence 
{N(v)}^ =1 such that for all v, if n > N(v) then Pr(\L(Ap s , (n) ,p t ( n )) ~ £(Av(«:> >Ps*(n))\ > e v) < For 
each n we define 

D l - { § («)l i (As* (n )'P§(»)) ~ i (A's*(»)'^ s *(»))l < e *}- 
For each n > N(v) we have 

Pr(t(n) G D£) = l-Pr(|i(A- a . ( „),Pt(n)) -^(A- s » ( „).Ps*( n ))| > d») > 

For n > N(l), define D n = D^ n \ where v(n) = max(v\N(v) < n; v < n). Thus for all, n > N(l) we have 
Pr(t(n) G D n ) > 1 — 5 v t n ) and 

SUp \L(Ap B , (riV Ps( n )) - L(Ap eit , 3 ,p s »(n))l < e «(n)- 
s(n)£D„ 

Note that u(n) tends to infinity, it follows that 6 v t n > ~ > and e„(„) — >• 0. Therefore, Pr(t(n) G D n ) — > 1 
and 

lim { sup \L(A Ps , ,pz {n )) - L(A Ps , ,p s ,( n ))\} = 0. 
That is A is weakly robust, w.r.t. p s which is a desired contradiction. □ 



A. 4 Mc Diarmid inequality 



Let X±, . . . , X n be n independent random variables taking values in X and let Z = ,f{X\, . . . , X n ). If for 
each 1 < i < n, there exists a constant Ci such that 

sup \f(xi,. . .,Xi,...,x n ) - f(xx, . . . ,x'i, . . . ,x n )\ < (k, VI < i < n, 

xi,.. .,x n ,x'.£L?£ 

( -2e 2 

then for any e > 0, Pr[|Z - E[Z] > e] < 2 exp ( = ; 



A. 5 Proof of Example 2 norm) 

Proof Let M* be the solution given training data p s . Due to optimality of M*, we have ||M*||i < go/c. 
We can partition Z as \Y\Af('~f/2, X, || • ||i) sets, such that if z and z' belong to the same set, then y = y' 
and ||x - x'\\i < 7. Now, for z x , z 2 , z[, z' 2 G Z, if yi = y[, \\xi - x[\\i < 7, y 2 = y 2 and \\x 2 - x' 2 \\x < 7, 
then: 



\g(y 12 [1 -f(M*, Xl ,x 2 )})- g(y[ 2 [1 - /(M*, ^ , 4)]) | 

< 17(1(0!! - x 2 ) T M*(a; 1 - x'^l + \{x x - a; 2 ) T M*(x' 2 - x 2 )\ 
+ \{x 1 -x\) T M.*{x\+x' 2 ) \ + \ {x l 2 -x 2 ) T M.*{x l 1 +x' 2 )\) 

< U{\\ Xl - rr a ||oo||M*||i||a;a - z'llli + 11*1 - as 2 ||oo||M*||i||a/ 2 - x 2 \\ x 

+ \\xi - x' l \\i\\M l '\\x\\x' l - X 2 \\oo + \\x' 2 - X2\\x\\M*\\x\\x' 1 - x'zWoo) 

/ SUR 19o 



A. 6 Proof of Example 3 (£ 2 ,i norm) 

Proof Let M* be the solution given training data p s . Due to optimality of M*, we have ||M* 1 1 2, x < go/c 
We can partition Z in the same way as in the proof of Example 1 and use the inequality ||M* \\jr < ||M* 1 1 2, x 
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(from Theorem 3 of Feng [2l|) to derive the same bound: 

\g(y 12 [l-f(M*,x 1 ,x 2 )})-g(y' 12 [l-f(M*,x[,x' 2 )})\ 

< U{\\ Xl - x 2 \\ 2 \\M*y\\ Xl - 4|| 2 + \\ Xl - x 2 \\ 2 \\M*\\f\\x' 2 - x 2 \\ 2 
+ \\ Xl - 4|| 2 ||M*[[^||4 - x' 2 \\ 2 + \\x' 2 - x 2 [[ 2 ||M*|H|zi - 41b) 

< U{\\xi - x 2 || 2 ||M*|| 2 , 1 || a ; 1 - 4|| 2 + \\ Xl - x 2 \\ 2 \\M*\\ 2A \\x' 2 - x 2 \\ 2 
+ \\xi - 4|| 2 ||M*|| 2>1 ||4 - 4|| 2 + ||4 - x.hWM*]^^ - x' 2 \\ 2 ) 



< 



8UR igo 



□ 



A. 7 Proof of Example 4 (Kernelization) 

We assume HI to be an Hilbert space with an inner product operator (•, •}. The mapping 0(-) is continuous 
from X to H. The norm || • ||h : H — > K is defined as ||w||h = \J {w, w) for all w G H, for matrices ||M||e 
we take the entry wise norm by considering a matrix as a vector, corresponding to the Frobenius norm. 
The kernel function is defined as k(xi,x 2 ) = (0(xi), 4>{x 2 )). 

B 1 and /h(t) are finite by the compactness of X and continuity of k(-, •). Let M* be the solution given 
training data p s , by the optimality of M* and using the same trick as the other examples we have: 
||M*||h < go/c. Then, by considering a partition of Z into \Y\J\f(j/2,X, \\ ■ || 2 ) disjoint subsets such that 
if (xx,yx) and (x 2 ,y 2 ) belong to the same set then yi = y 2 and ||a;i — cc 2 || 2 < 7. 



We have 



\g( yij [l - /(M*, 4>(xi), 4>{x 2 ))]) - g( yij [l - /(M*, 0(4), 0(4))])| 

< U(\(<P( Xl ) - ^x 2 )) T M*(^ Xl ) - 0(4))! + \(<P( X1 ) - 0(x 2 )) T M*(0(4) - 0(* 2 ))| 
+ U{ Xl ) - 0(4)) T M*(0(4) + 0(4))| + 1(0(4) - ^fe)) T M^K) + 0(4 2 ))|) 

< tr(|0(xO T M*(^(i 1 )-^(a/ 1 ))| + |^( 3!2 ) T M*(^( a ;i)-^(si))|+ (11) 
l^xxfM*^)^))! + |0(x 2 ) T M*(0(4) - <t>{x 2 ))\ + 

|(0(zi) - 0(4)) T M*0(4)| + \(4>( Xl ) - 0(4)) T M*0(4)| + 
|(0(4) - #r 2 )) T M* 0(4)| + |(0(4) - 0(z 2 )) T M*0(4)|). 



Then, note that 



|0(x 1 ) T M*(0( a;i )-0(4))| < y/(<K Xl ), 0(x 1 ))||M* || HV /(0(4) - <K4),</>W) - <t>{^)) 



Thus, by applying the same principle to all the terms in the right part of inequality (|11[) . we obtain: 



| 5 (^[i-/(M*,0(^), 0(z 2 ))]) -s(y«[i-/(M*, 0(4), 0(4))])l < 8UB -<^^ go . 



References 

[1] M. Schultz and T. Joachims. Learning a Distance Metric from Relative Comparisons. In NIPS, 2003. 

[2] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In ICML, 
2004. 

[3] R. Rosales and G. Fung. Learning Sparse Metrics via Linear Programming. In KDD, pages 367-373, 
2006. 

[4] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In 
ICML, pages 209-216, 2007. 



Technical Report vl.O 



11 



Aurelien Bellet, Amaury Habrard 



Robustness and Generalization for Metric Learning 



[5] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online Metric Learning and Fast Similarity Search. 
In NIPS, pages 761-768, 2008. 

[6] K. Q. Weinberger and L. K. Saul. Distance Metric Learning for Large Margin Nearest Neighbor 
Classification. Journal of Machine Learning Research, 10:207-244, 2009. 

[7] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang. An Efficient Sparse Metric Learning in 
High-Dimensional Space via 11-Penalized Log-Determinant Rcgularization. In ICML, 2009. 

[8] Y. Ying, K. Huang, and C. Campbell. Sparse Metric Learning via Smooth Optimization. In NIPS, 
pages 2214-2222, 2009. 

[9] G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An Online Algorithm for Large Scale Image 
Similarity Learning. In NIPS, pages 306-314, 2009. 

[10] A. M. Qamar. Generalized Cosine and Similarity Metrics: A supervised learning approach based on 
nearest-neighbors. PhD thesis, University of Grenoble, 2010. 

[11] U. Shalit, D. Weinshall, and G. Chechik. Online learning in the manifold of low-rank matrices. In 
NIPS, pages 2128-2136, 2010. 

[12] P. Kar and P. Jain. Similarity-based Learning via Data Driven Embeddings. In NIPS, 2011. 

[13] W. Bian and D. Tao. Learning a Distance Metric by Empirical Loss Minimization. In IJCAI, pages 
1186-1191, 2011. 

[14] R. Jin, S. Wang, and Y. Zhou. Regularized distance metric learning: Theory and algorithm. In NIPS, 
pages 862-870, 2009. 

[15] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 
2:499-526, 2002. 

[16] H. Xu, C. Caramanis, and S. Mannor. Sparse Algorithms Are Not Stable: A No-Free-Lunch Theorem. 
IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):187-193, 2012. 

[17] H. Xu and S. Mannor. Robustness and generalization. In COLT, pages 503-515, 2010. 

[18] H. Xu and S. Mannor. Robustness and generalization. Machine Learning Journal, 86(3):391-423, 
2012. 

[19] A.N. Kolmogorov and V.M. Tihomirov. e-entropy and e-capacity of sets in functional spaces. American 
Mathematical Society Translations, 17(series 2):277-364, 1961. 

[20] A.W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer, 2000. 

[21] B.Q Feng. Equivalence constants for certain matrix norms. Linear Algebra and Its Applications, 
374:247-253, 2003. 



Technical Report vl.O 



12 



